Virtual target addressing during direct data access via VF of IO storage adapter

ABSTRACT

A method of virtual machine (VM) access to physical storage through a direct path to a virtual function (VF) of a storage adapter, the method for use in a system that includes a host computing machine configured to implement a virtualization intermediary and the virtual machine (VM) and that includes the storage adapter the method, comprising: sending virtual SCSI IO request from the VM to the physical storage that identifies a virtual disk address; mapping within the VF the identified virtual address to at least one physical region of the physical storage; creating within the VF a physical SCSI IO request that identifies a physical address for the mapped-to physical region; sending the physical SCSI IO request from the VF to the physical storage.

CROSS REFERENCE TO RELATED APPLICATION

The subject matter of this application is related to that of commonlyowned patent application Ser. No. 12/689,162, entitled, Configuring VMand IO Storage Adapter VF for Virtual Target Addressing During DirectData Access, filed on even day herewith.

BACKGROUND

A host computer system may run multiple virtual machines (VMs) thatshare common resources such as physical storage. Physical storage usedby the VMs typically is emulated so as to present virtual storageresources to the VMs. A virtualization intermediary manages interactionbetween VMs and physical storage. Some prior virtualizationintermediaries “trap” (intercept) virtual storage requests issued byindividual VMs and redirect the requests from virtual targets tophysical targets. Such earlier virtualization intermediary uses traphandlers during emulation to redirect IO commands to prevent storageaccess violations. However, this emulation can be expensive in terms ofinstructions processed. Overall performance may decline when many VMsseek to access physical storage at the same time. The many storagerequests can result in data access delays due to the compute cyclesrequired by the virtualization intermediary to trap and translatesimultaneous data requests from many VMs.

One solution to this problem has been proposed in the Single RootVirtualization I/O and Sharing Specification, Revision 1.0, Sep. 11,2007 (PCI-SIG SR-IOV) specification. The PCI-SIG SR-IOV specification,which proposes providing each of one or more VMs with direct access tophysical storage through its own storage adapter instance as adesignated virtual function (VF) running on a physical storage adapterso as to avoid the need for heavy intervention by the virtualizationintermediary to gain access to physical storage.

Unfortunately, direct access that bypasses the virtualizationintermediary can result in loss of virtualization intermediary residentstorage IO features such as virtual disk based provisioning.

SUMMARY

In one aspect, a virtual machine (VM) uses virtual addressing to accessphysical storage through a direct path to a virtual function (VF) of astorage adapter. The method is for use in a system that includes a hostcomputing machine configured to implement a virtualization intermediaryand the virtual machine (VM) and that includes the storage adapter. Avirtual SCSI IO request that identifies a virtual address is sent from aVF driver of the VM to the VF. Within the VF, the identified virtualdisk address is mapped to a physical region of the physical storage. Aphysical SCSI IO request that identifies a physical address for the atleast one mapped-to physical region is created within the VF. PhysicalSCSI IO request is sent from the VF to the physical storage. Thus,virtual addressing is employed on a direct access path between a VM anda VF of the storage adapter.

In another aspect, certain error conditions detected in a virtual SCSIIO request are reported from the VF to the virtualization intermediary.The virtualization intermediary corrects certain error conditions andreports the correction to the VF. For example, when the VF determinesthat a virtual disk address within a virtual SCSI IO request refers toan unallocated region of the virtual disk, the VF reports the errorcondition to the virtualization intermediary. The virtualizationintermediary changes the allocation of physical storage to the virtualdisk and reports the change to the VF. Thus, the virtualizationintermediary intervenes in SCSI IO operation on the direct access pathunder certain error conditions, but otherwise, the VM generallycommunicates with the VF over the direct access path withoutintervention by the virtualization intermediary.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is an illustrative drawing showing one possible arrangement of acomputer system that implements virtualization.

FIG. 2 is an illustrative drawing of a virtualized system includingSR-IOV virtualization.

FIG. 3 is an illustrative drawing of a system that includes a hostmachine that hosts a virtualization intermediary and a virtual machineand that is coupled to access physical storage through an IOV adapter.

FIGS. 4A-4D are illustrative drawings that show a process to provisionand instantiate the virtualized computer resources of the system of FIG.3.

FIG. 5 is an illustrative transition diagram that illustrates processflow during a successful IOV Read/Write operation by the system of FIG.3.

FIG. 6 is an illustrative transition diagram that illustrates processflow during an IOV Read/Write operation by the system of FIG. 3 in whichan error is identified by a virtual function.

FIG. 7 is an illustrative drawing of the mapping process run within a VFto track physical SCSI commands associated with a virtual SCSI commandfor completion processing and for processing SCSI command aborts andSCSI LUN, target, and bus reset task management operations.

DETAILED DESCRIPTION OF THE EMBODIMENTS

The following description is presented to enable any person skilled inthe art to create and use a computer system configured for use with anIOV adapter in which a virtual logical unit can be used to accessphysical storage via a virtual function, and is provided in the contextof particular uses and their requirements. Various modifications to thepreferred embodiments will be readily apparent to those skilled in theart, and the generic principles defined herein may be applied to otherembodiments and applications without departing from the spirit and scopeof the invention. Moreover, in the following description, numerousdetails are set forth for the purpose of explanation. However, one ofordinary skill in the art will realize that the invention might bepracticed without the use of these specific details. In other instances,well-known structures and processes are shown in block diagram form inorder not to obscure the description of the invention with unnecessarydetail. Thus, the present invention is not intended to be limited to theembodiments shown, but is to be accorded the widest scope consistentwith the principles and features disclosed herein.

In this description, reference is sometimes made to a virtual machine, ahypervisor kernel, a virtual machine monitors (VMMs), a virtualizationintermediary or some other virtualized component taking some action.Persons skilled in the art will appreciate that a hypervisor kernel,VMMs and a virtualization intermediary comprise one or more softwarelayers that run on a host system, which comprises hardware and software.In order to perform any given action, a virtual machine, virtualizationintermediary or other virtualized component configures physicalresources of the host machine to perform the given action. For example,a virtualization intermediary may configure one or more physicalprocessors, according to machine readable program code stored in machinereadable storage device, to re-assign physical processors allocated toI/O management.

Overview of Virtualization

FIG. 1 is an illustrative drawing showing one possible arrangement of acomputer system 100 that implements virtualization. As is well known inthe field of computer science, a virtual machine involves—a“virtualization”—in which an actual physical machine is configured toimplement the behavior of the virtual machine. In the example system ofFIG. 1, multiple virtual machines (VMs) or “guests” VM1 to VMn areinstalled on a “host platform,” referred to as a “host,” which includessystem hardware, that is, hardware platform 104, and one or more layersor co-resident software components comprising a virtualizationintermediary, e.g. a virtual machine monitor (VMM), hypervisor or somecombination thereof. The system hardware typically includes one or moreprocessors 106, memory 108, some form of mass storage 110, and variousother devices 112, such as an IO storage adapter to perform protocolconversions required to access remote storage such as within a storageaccess network (SAN) 113 and to coordinate concurrent accesses to suchstorage.

Each virtual machine VM1 to VMn typically will have both virtual systemhardware 114 and guest system software 116. The virtual system hardwaretypically includes one or more virtual CPUs (VCPUs) 116-1 to 116-m,virtual memory 118, at least one virtual disk 122, and one or morevirtual devices 120. The virtual hardware components of the virtualmachine may be implemented in software using known techniques to emulatethe corresponding physical components. The guest system softwareincludes guest operating system (OS) 124 and drivers 126 as needed forthe various virtual devices 120. In many cases, software applications128 running on a virtual machine VM1 will function as they would if runon a “real” computer, even though the applications are running at leastpartially indirectly, that is via guest OS 124 and virtual processor(s).Executable files will be accessed by the guest OS from virtual disk 122or virtual memory 118, which will correspond to portions of an actualphysical disk 110 or memory 108 allocated to that virtual machine.

A software component referred to herein as a ‘virtualizationintermediary’ serves as an interface between the guest software within avirtual machine and the various hardware components and devices in theunderlying hardware platform. The virtualization intermediary mayinclude VMMs, hypervisor (also referred to as a virtualization “kernel”)or some combination thereof. Because virtualization terminology hasevolved over time and has not yet become fully standardized, these threeterms do not always provide clear distinctions between the softwarelayers and components to which they refer. In some systems, somevirtualization code is included in at least one “superior” virtualmachine to facilitate the operations of other virtual machines.Furthermore, specific software support for virtual machines may beincluded in the host OS itself. For example, the term ‘hypervisor’ oftenis used to describe both a VMM and a kernel together, either as separatebut cooperating components or with one or more VMMs incorporated whollyor partially into the hypervisor itself to serve as a virtualizationintermediary. However, the term hypervisor also is sometimes usedinstead to mean some variant of a VMM alone, which interfaces with someother software layer(s) or component(s) to support the virtualization.

One use of the term hypervisor signifies a software layer implemented tomanage physical resources, process creation, I/O stacks, and devicedrivers. Under such an implementation, the hypervisor 132 would managethe selections of physical devices and their temporary assignment tovirtual devices. For example, the hypervisor 132 would manage themapping between VM1-VMn and their VCPUs 116-1 to 116-m, virtual memory118, and the physical hardware devices that are selected to implementthese virtual devices. More particularly, when a virtual processor isdispatched by a VM, a physical processor, such as one of the physicalprocessors 104, would be scheduled by the hypervisor 132 to perform theoperations of that virtual processor. In contrast, in the context ofsuch implementation, VMM1-VMMn might be responsible for actuallyexecuting commands on physical CPUs, performing binary translation (BT)or programming of virtual hardware, for example. Note that the VMM is‘instanced’ meaning that a separate instance of the VMM is created foreach VM. Thus, although in this example, such a hypervisor and a VMM maybe distinct, they would work together as a virtualization intermediary.Unless otherwise indicated, the term ‘virtualization intermediary’encompasses any combination of VMM and hypervisor (or hypervisor kernel)that provides a virtualization layer between a guest OS running on VMsand the host hardware.

In the system of FIG. 1, the virtual machine monitors VMM1 to VMMn areshown as separate entities from the hypervisor kernel software 132 thatrun within VM1 to VMn, respectively. The VMMs of the system of FIG. 1emulate virtual system hardware. While the hypervisor 132 is shown as asoftware layer located logically between all VMs and the underlyinghardware platform and/or system-level host software, it would bepossible to implement at least part of the hypervisor layer inspecialized hardware. The illustrated embodiments are given only for thesake of simplicity and clarity and by way of illustration since asmentioned above, the distinctions are not always so clear-cut. Again,unless otherwise indicated or apparent from the description, it is to beassumed that one or more components of the virtualization intermediarycan be implemented anywhere within the overall structure of suchvirtualization intermediary, and may even be implemented in part withspecific hardware support for virtualization.

The various virtualized hardware components of the VM1, such as VCPU(s)116-1 to 116-m, virtual memory 118, virtual disk 122, and virtualdevice(s) 120, are shown as being emulated within VMM1, which runswithin virtual machine VM1. One advantage of such an arrangement is thatthe virtual machine monitors VMM1 to VMMn may be set up to expose“generic” devices, which facilitate VM migration and hardwareplatform-independence. For example, the VMM1 may be set up to emulate astandard Small Computer System Interface (SCSI) disk, so that thevirtual disk 122 appears to the VM1 to be a conventional SCSI diskconnected to a conventional SCSI adapter, whereas the underlying,actual, physical disk 110 may be something else. The term “disk”typically signifies persistently stored data addressed in sequence,typically from address zero to address max capacity-1. In that case, aconventional SCSI driver typically would be installed into the guest OS124 as one of the drivers 126. A virtual device 120 within the VMM thenwould provide an interface between VM1 and a physical driver 126 that ispart of the host system and would handle disk operations for the VM1.

Different systems may implement virtualization to differentdegrees—“virtualization” generally relates to a spectrum of definitionsrather than to a bright line, and often reflects a design choice withrespect to a trade-off between speed and efficiency on the one hand andisolation and universality on the other hand. For example, “fullvirtualization” is sometimes used to denote a system in which nosoftware components of any form are included in the guest OS other thanthose that would be found in a non-virtualized computer; thus, the guestOS 124 could be an off-the-shelf, commercially available OS with nocomponents included specifically to support use in a virtualizedenvironment.

In contrast, another term, which has yet to achieve a universallyaccepted definition, is that of “para-virtualization.” As the termimplies, a “para-virtualized” system is not “fully” virtualized, butrather the guest is configured in some way to provide certain featuresthat facilitate virtualization. For example, the guest in somepara-virtualized systems is designed to avoid hard-to-virtualizeoperations and configurations, such as by avoiding certain privilegedinstructions, certain memory address ranges, etc. As another example,some para-virtualized systems include an interface within the guest thatenables explicit calls to other components of the virtualizationsoftware.

For some, the term para-virtualization implies that the guest OS (inparticular, its kernel) is specifically designed to support such aninterface. Others define the term para-virtualization more broadly toinclude any guest OS with any code that is specifically intended toprovide information directly to any other component of thevirtualization software. According to this view, loading a module suchas a driver designed to communicate with other virtualization componentsrenders the system para-virtualized, even if the guest OS as such is anoff-the-shelf, commercially available OS not specifically designed tosupport a virtualized computer system. Unless otherwise indicated orapparent, embodiments are not restricted to use in systems with anyparticular “degree” of virtualization and are not to be limited to anyparticular notion of full or partial (“para-”) virtualization.

In addition to the sometimes fuzzy distinction between full and partial(para-) virtualization, two arrangements of intermediate system-levelsoftware layer(s) are in general use—a “hosted” configuration and anon-hosted configuration. In a hosted virtualized computer system, anexisting, general-purpose operating system forms a “host” OS that isused to perform certain input/output (I/O) operations, alongside andsometimes at the request of the VMM.

The system of FIG. 1 is an example of a non-hosted configuration inwhich VMMs are deployed on top of a software layer—hypervisor kernel132—constructed specifically to provide support for the virtualmachines. Kernel 132 also may handle any other applications running onit that can be separately scheduled, as well as a console operatingsystem that, in some architectures, is used to boot the system andfacilitate certain user interactions with the virtualization software.

PCI SR-IOV

Many modern computing devices employ input/output (IO) adapters andbuses that utilize some version or implementation of the PeripheralComponent Interconnect (PCI) standard, which specifies a computer busfor attaching peripheral devices to a computer motherboard. PCI Express(PCIe) is an implementation of the PCI computer bus that uses existingPCI programming concepts, but bases the computer bus on a different andmuch faster serial physical-layer communications protocol. In additionto the PCI and PCIe specifications, the PCI-SIG has defined input/outputvirtualization (IOV) standards for defining how to design an IO adapterthat can be shared by several virtual machines.

The term “function” is used in the PCI context to signify a device withaccess controlled by a PCI bus. A PCI function is identified within asingle PCI root complex by its PCI or PCIe bus, device, and slotidentifiers. A PCI function includes a configuration space, whichincludes both device dependent and device independent regions used byhost software to support device relocation on the PCI bus, flexibledevice-to-interrupt binding, device identification, and deviceconfiguration. A function also includes memory space which is identifiedby Barrier Address Registers in configuration space and provides amemory mapped I/O interface for host I/O initiated from host to thedevice. A PCIe function also includes message space which is identifiedby Message Signaled Interrupt (MSI) and Message SignaledInterrupt-Extended (MSI-X) capabilities in configuration space andprovides either or both MSI/MSI-X message based interrupt generation.Many network (e.g., Ethernet) and storage (e.g., disk) adapters areimplemented as PCI or PCIe compliant adapters and are recognized by amachine's PCI sub-system as a single PCI function. Multi-port PCI orPCIe adapters simply appear to a host PCI sub-system as multiple PCIfunctions.

FIG. 2 is an illustrative drawing of a virtualized system 200 includingSR-IOV virtualization. Techniques specified in the PCI-SIG SR-IOVspecification can be used to reduce the CPU impact of high throughputworkloads by bypassing the virtualization intermediary. The term ‘singleroot’ refers to a single root complex as contrasted with a multiple rootcomplex. In a PCI Express system, a root complex device couples theprocessor and memory subsystem to a PCI Express switch fabric comprisedof one or more switch devices. The root complex generates transactionrequests on behalf of the processor, which is interconnected through alocal bus.

The illustrative system includes VMs 202-206, each independently runninga separate (and possibly different) guest operating system. Avirtualization intermediary layer 218 runs between the virtual machines202-206 and a host machine 216. Device driver 208 of VM 202 and devicedriver 210 of VM 204 each drive a physical function (PF) 222, withintervention by the virtualization intermediary 218. Device driver 212of VM 206 drives the virtual function (VF) 228, without intervention bythe virtualization intermediary 218. The device driver 212 communicateswith 110 MMU logic 224 disposed on the host machine 216 in the course ofaccessing data with mass storage (not shown). A device manager 220within virtualization intermediary 218 manages the allocation andde-allocation of VFs for the IOV adapter 214. The IOV adapter 214provides a memory-mapped input/output interface for IO and provides aninterface for controlling VFs.

A typical IOV adapter includes processor, memory and network interfaceresources (not shown) to implement the PF and one or more virtualfunctions VFs. A PF is a PCIe function that supports the SR-IOVcapabilities defined in the PCI SR-IOV specification. A PF is used tocontrol the physical services of the device and to manage individualVFs.

A VF is a PCIe function which is associated with a particular physicalfunction and shares physical PCI adapter resources (e.g., ports, memory)with that physical function and other virtual functions located on thesame physical adapter. A virtual function has its own PCI configurationspace, memory space, and message space separate from other physical orvirtual functions on that same adapter. A physical function, such as PF222 in this example that is associated with a virtual function 228 isresponsible for allocating, resetting, and de-allocating that virtualfunction and the PCI resources required by that virtual function. Ingeneral, a VF can either be accessed via a virtualization intermediaryor bypass the virtualization intermediary to be directly accessed by aguest OS. In the example system 200, VMs 202, 204 respectively access PF222 via the virtualization intermediary 218, and VM 206 accesses VF 214directly, i.e. without the virtualization intermediary 218. Thus, a VFcan, without run-time intervention by a virtualization intermediary,directly be a sink for I/O and memory operations from a VM, and be asource of Direct Memory Access (DMA), completion, and interruptoperations to a VM.

SCSI Command Protocol

The International Technology Standards (INCITS) T10 Technical Committeehas adopted a layered approach that divides the Small Computer SystemInterface (SCSI) into multiple layers of standards. The lowest layerrefers to physical interfaces sometimes referred to as physicaltransports. The next layer up pertains to transport protocols usuallydirectly associated with one physical transport standard. The top layerconsists of command sets associated with specific devices such as diskdrives or tape drives, for example. See, J. Lohmeyer, SCSI StandardsArchitecture, Business Briefing: Data Management & Storage Technology2003. A result of this layered approach to the SCSI standard is thatthere are over 30 SCSI standards. In general, only a few of thesestandards apply to a given product. As used herein, the term ‘SCSI’signifies compliance with one or more of these SCSI standards.

A SCSI command is a request describing a unit of work to be performed bya device server. A SCSI command descriptor block (CDB) is a structureused to communicate commands from an application client to a deviceserver. The SCSI command set assumes an underlying request-responseprotocol. The fundamental properties of the request-response protocolare defined in SCSI Architecture Model (SAM)-3, Revision 14. Action onSCSI commands is not be deemed completed until a response is received.The response ordinarily includes a status that indicates the finaldisposition of the command. See, SCSI Primary Commands-3 (SPC-3),Revision 23, Section 4.2, The request-response model, May 4, 2005,American National Standards for Information Systems—InterNationalCommittee for Information Technology Standards. (hereinafter “SPC-3,Revision 23”)

A SCSI device is a device that contains one or more SCSI ports that areconnected to a service delivery subsystem and supports a SCSIapplication protocol. An application client is an object that is thesource of SCSI commands. A SCSI initiator device is a SCSI device thatcontains application clients and SCSI initiator ports that originatedevice service and task management requests to be processed by a SCSItarget device and receive device service and task management responsesfrom SCSI target devices. A SCSI initiator port is a SCSI initiatordevice object that acts as the connection between application clientsand the service delivery subsystem through which requests and responsesare routed. A SCSI target device is a SCSI device containing logicalunits and SCSI target ports that receive device service and taskmanagement requests for processing and sends device service and taskmanagement responses to SCSI initiator devices. A SCSI target port is aSCSI target device object that acts as the connection between deviceservers and task managers and the service delivery subsystem throughwhich requests and responses are routed. A logical unit is an externallyaddressable entity within a SCSI target device that implements a SCSIdevice model and contains a device server. See, SPC-3, Section 3.1Definitions.

For the purpose of the following description, it is assumed that the IOstorage adapter described herein employs a port based SCSI transportprotocol, such as Fiber Channel, iSCSI or SAS, to transfer of databetween a host system IO bus and SCSI storage. In accordance with theiSCSI transport protocol, for example, a SCSI initiator is responsiblefor packaging a SCSI CDB perhaps with the aid of a machine's operatingsystem and sending the CDB over an IP network. An iSCSI target receivesthe CDB and sends it to an iSCSI logical unit, which may be a disk,CD-ROM, tape drive, printer, scanner or any type of device, managed by aSCSI target. The SCSI target sends back a response to the CDB thatinclude a status that indicates the final disposition of the command.

A SCSI target may manage numerous SCSI logical units. In someembodiments, a SCSI target identifier in combination with a SCSI LUN(logical unit number) and a Logical Block Address (LBA) constitutes astorage address. A separate parameter indicates the size of the storageregion located at the specified storage address in terms of the numberof contiguous blocks associated with the address. A SCSI LUN, serves asan instance identifier for a SCSI logical unit that uniquely identifiesa SCSI logical unit within the scope of a given SCSI target at a giventime. A SCSI LBA is the value used to reference a logical block within aSCSI logical unit.

Multiple SCSI commands can be issued concurrently i.e. a host system canissue a new SCSI Command before a prior SCSI Command completes.Moreover, as will be explained more fully below, one “virtual” SCSIcommand to access a (virtual SCSI target, virtual SCSI logical unit)combination can result in multiple “physical” SCSI commands to one ormore physical SCSI logical units of one or more physical SCSI targets inwhich case a VF servicing the one virtual SCSI command must wait untilall of the multiple SCSI commands to the multiple physical SCSI target,physical SCSI logical unit combinations have completed before‘completing’ the virtual SCSI command.

IOV with Virtual Storage on Fast Access Data Path

FIG. 3 is an illustrative drawing of a system 300 that includes a hostmachine 302 that hosts a virtual machine 304 and that is coupled to astorage adapter 306 that adapts IO communications over a PCI busprotocol of the host machine 302 to SCSI storage access protocols usedto access persistent physical storage 308. The system 300 of FIG. 3 isan implementation that in general possesses much the same general typeof configuration and component structures explained with reference toFIG. 1 except that VM 304 is configured for direct-access to physicalstorage via the IOV storage adapter 306 in accordance with the PCISR-IOV specification. However, details are omitted from FIG. 3 so as tonot obscure IOV features.

In some embodiments the PCI bus protocol is compliant with both the PCI(Peripheral Component Interconnect) Express specification and the PCIeSR-IOV extension specification, and SCSI commands are used with one ormore SCSI transport protocols such as iSCSI, SAS or Fibre Channel todirectly communicate IO access requests (Read/Write) with persistentphysical storage 308 such as SAN storage, for example. Moreparticularly, a storage adapter 306 allows a virtual machine 304 toaccess physical storage 308 via IOV direct access for certain SCSIRead/Write CDBs and allows access via a virtualization intermediary 310for other SCSI CDBs. Accordingly, certain frequently occurring Read andWrite SCSI commands are directed over a fast IOV data path couplingbetween the virtual machine 304 and a virtual function 316 substantiallywithout involvement of the virtualization intermediary 310. However,certain error conditions, such as recoverable error conditions, aredirected to the virtualization intermediary 310 to resolve more complexconditions that require more elaborate processing.

The storage adapter 306 includes adapter resources 339 such as processorand memory resources and network protocol translation and interfaceresources, which will be readily understood by persons skilled in theart, to implement a physical function (PF) 314 and the virtual function(VF) 316. In the illustrated embodiment, the VF 316 is associated withvirtual machine 304. A PF driver 318 communicates information betweenthe PF 314 and the virtualization intermediary 310. A VF driver 321communicates information with both the VF 316 and a hybrid storageadapter (HSA) 320 instantiated within the virtual machine 304. Althoughonly one VM 304 and one corresponding VF 316 are shown and describedherein, it will be appreciated that the host system 302 may hostmultiple VMs and the adapter 306 may implement multiple correspondingVFs, and the description herein would apply to each such combination ofVM and VF. Multiple VFs (only one shown) may be instantiated within theadapter 306, and that each respective virtual machine (only one shown)may be associated with a different respective VF to create respectiveIOV data paths for certain frequently occurring Read and Write SCSIcommands.

The HSA 320 ‘appears’ to be a physical PCI device (i.e. a storageadapter) from the perspective of the virtual machine 304. The HSA 320acts as the virtual machine's interface to the physical storage world,i.e. to the physical storage adapter 306. The hybrid storage adaptercomprises an emulated PCI storage adapter within the VM 304, whichencapsulates a PCI SR-IOV virtual function of an SR-IOV compliantphysical storage adapter presented by the virtualization intermediary310 within a protected memory space of the virtual machine. A PCIconfiguration space 309 of the virtual function 316 is copied to theHSA's PCI configuration space, so as to provide a memory mappedinterface to the first HSA PCI memory space 311 that supports directaccess to physical storage 308. The HSA 320, through the first PCImemory space, provides a direct access path to the guest OS 307 of thevirtual machine 304 with the capability to issue 110 directly to thephysical adapter 306 without intervention by the virtualizationintermediary 310. In addition, the HSA's PCI configuration space maps tothe second HSA PCI memory mapped interface 313 that supports fullyemulated access to physical storage 308 via the physical adapter 306through the virtualization intermediary 310. Although the HSA 320 isshown resident within a protected memory space of the virtual machine304, it will be appreciated that it could instead reside within thevirtualization intermediary 310. The HSA 320 is referred to as a‘hybrid’ herein because has two memory mapped interfaces 311 and 313.

The VF driver 321 is savvy as to the hybrid nature of the HSA 320, andas such is a ‘para-virtual’ device driver. The VF driver 320 directscertain SCSI IO operations to the first HSA PCI memory space 311 fordirect access to physical storage 308 via the VF. The VF driver 321directs other SCSI operations to the second HSA PCI memory space 313 forfully emulated access to physical storage 308.

In some embodiments, the virtualization intermediary 310 is implementedas the ‘ESX’ hypervisor produced by VMware, Inc. having a place ofbusiness in Palo Alto, Calif. The ESX hypervisor serves as avirtualization intermediary having both VMM and hypervisorfunctionality. Each VM (e.g. virtual machine 304) runs on top of ESX. Inan ESX environment, a portion of each VM comprises a VMM. That is, VMMsare embedded in the VM's address space, albeit in a protected region. Insome embodiments, the hybrid storage adapter 320 also resides in aprotected memory space of the VM, and more particularly, runs within theexecution context of the VMM that is embedded in that VM's memory space.If a given VM has multiple VCPUs, then each VCPU has an associated VMM.In an ESX environment, the VMM/hypervisor virtualization intermediaryserves as the primary memory management component to manage multiplesimultaneously running VMs.

A SCSI target emulation module 322 within the virtualizationintermediary 310 maps virtual SCSI logical units (i.e virtual disks) tophysical SCSI targets and physical SCSI logical units located in storage308. For each VM (only one shown) that has an associated VF (only oneshown), mapping metadata provided by the SCSI target emulation module322 is communicated to the associated VF 316 via the PF driver 318, PF314 and interconnect circuitry 324 within the adapter 306. The VF 316,thereby is provided with the mapping information used to provide directaccess to physical SCSI targets and physical SCSI logical units withinphysical storage 308 that have been allocated to the one or more virtualdisks allocated to such virtual machine 304. Note that the interconnectcircuitry 324 may be implemented in hardware or firmware, for example.

Provisioning and Instantiation of Virtualized Compute Resources for IOV

FIGS. 4A-4D are illustrative drawings that show a process to provisionand instantiate the virtualized computer resources of the system 300 ofFIG. 3. Dashed lines are used in these drawings to represent componentsthat are in the process of being instantiated. Solid lines are used torepresent components that already have been instantiated. Arrowsrepresent flow of control or information. Certain components in FIGS.4A-4D shown above the host 302 represent a virtual machine 304 and thevirtualization intermediary 310 configuring the host machine 302according to machine readable program code stored in machine readablestorage device to perform specified functions of the components. Each ofthe drawings of FIGS. 4A-4D represents a different stage of theprovisioning and instantiating of the virtualized computer resources thesystem 300 of FIG. 3.

Referring to FIG. 4A, in the course of instantiating the virtualizationintermediary 310, a known PCI manager routine 355 scans the PCI bus (notshown) of the host machine 302 and discovers the physical SR IOVhardware bus adapter (SR IOV adapter) 306 as a valid PCI device andinvokes all registered PCI compliant drivers until one driver claims theSR IOV adapter 306. In this case, PF driver 318 claims the SR IOVadapter 306 and discovers a PCIe function of the discovered device to bea physical function (PF) 314, which is indicated by SR-IOV capabilitiesspecified in the PF function's PCI configuration space 315.

After claiming the storage adapter's PF, the PF driver 318 obtainsattribute information from the physical adapter 306 via the PF 314 suchas the number of SCSI initiator ports and the speed of these ports. ThePF driver 318 also discovers physical SCSI target ports and attempts toestablish connections indicated by arrow 357 through the PF 314 withthese using SCSI transport protocol dependent mechanisms, such as FiberChannel, iSCSI and SAS, for example. Through these connections, the PFdriver 318 learns of possible connections to physical SCSI targetsrepresented by arrows 359. In this illustrative example, the PF driver318 learns of physical targets P_(T0) to P_(TN) and P_(TN) to P_(TM)through these connections 359. The PF driver 318 passes identifyinginformation, concerning the discovered physical SCSI targets up tostorage stack 328, to higher level software in the virtualizationintermediary 310 as represented by arrow 361.

Referring to FIG. 4B, the storage stack 328 probes the SCSI bus via thePF driver 318 and PF 314 for paths associated with physical storage 308by sending SCSI commands to all possible SCSI logical units on alldiscovered physical SCSI targets. In this example, the storage stacklearns of physical SCSI logical units P_(T0)(LU0, LU1, . . . ), P_(TN)(LU0, LU1, . . . ) to P_(TM) (LU0, LU1, . . . ) In this manner, thestorage stack 328 learns of redundant physical SCSI paths to reach alldiscovered physical SCSI logical units. A unique SCSI path can bedefined in terms of a unique combination, a three-tuple, comprising(SCSI initiator identifier, SCSI target identifier, SCSI logical unitnumber). The discovery of redundant SCSI paths is used for multipathmanagement within the storage stack component 328. During runtime, onerole of the storage stack 328 is to handle physical SCSI transportconnection errors; that is, to perform SCSI multipathing path failover.

Now that the virtualization intermediary 310 (i.e. storage stack 328)has learned of the physical storage resources, virtual storage resourcesare created, allocated and mapped to the physical storage resources.More particularly, a VM provisioning utility 350, which may be a part ofthe virtualization intermediary 310, creates virtual disks. In theillustrated example, virtual disks V_(D1), V_(D2), V_(D3), . . . arecreated. As illustrated in FIG. 4B, the VM provisioning utility maps thenewly created virtual disks to discovered physical logical units. In theillustrated example V_(D1) is mapped to P_(T0)(LU0); V_(D2) is mapped toP_(T0)(LU1); and V_(Dn) is mapped to P_(TM)(LU0), etc. The virtual diskcreation process typically also involves the creation of the file system328-1 and a logical volume manager (not shown) to track allocatedregions of physical storage of multiple physical SCSI logical unitsacross multiple virtual disk files on that file system There need not bea one-to-one relationship between physical SCSI logical units andvirtual disks. For example, the physical-to-virtual mapping may be assmall as an addressable SCSI logical block. Generally, the SCSI logicalblock size is configurable. See, SCSI Block Commands-3 (SBC-3), Revision15, Section 4.4 Logical Blocks, May 13, 2005, American NationalStandards for Information Systems—InterNational Committee forInformation Technology Standards, (hereinafter “SBC-3, Revision 15”).

More particularly, the VM provisioning utility 350, creates firstmapping metadata 361 to indicate the correlation between virtual disksand corresponding regions of physical SCSI logical units. Each region ofthe physical storage is identified by both a unique storage address anda size. Each respective virtual disk may correspond to one or more suchphysical storage regions. The first mapping metadata 361 provides amapping of individual virtual disks to one or more physical regionsallocated to that virtual SCSI logical unit. The first mapping metadata361 is persistently stored since one or more virtual machines to beprovisioned with such virtual disks may not be instantiated until sometime in the future.

Referring now to FIG. 4C, in the course of instantiation of virtualcompute resources for virtual machine 304, the virtualizationintermediary 310 creates virtual hardware resources for the virtualmachine 304 such as one or more virtual CPUs, virtualized memory, avirtualized PCI bus and one or more virtual PCI devices including ahybrid storage adapter (HSA) 320.

As part of the instantiation of the HSA 320, a VF 316, a virtual port(not shown), and an MSI or MSI-X interrupt vector (not shown) areallocated/reserved. The VF 316 is “bound” to both the virtual port andthe interrupt vector. More specifically, the virtualization intermediary310 creates (sets up) the virtual port, and the PF driver 318 is madeaware of the association and the PF driver passes the binding of thevirtual port and MSI/MSI-X interrupt vector to the VF 316 via the PF 314and the interconnect circuitry.

In order to utilize existing SAN and SCSI target access controlmechanisms (e.g., FC zoning and SCSI target based LUN masking) forauthorization of I/O from different VMs each utilizing different VFs onthe same physical storage adapter, I/O sent by a VF directly accessed bya virtual machine is associated with a virtual port assigned to that VFas opposed to the single physical port of the VF's physical adapter. Tothat end, during the resource provisioning phase, the above-mentionedvirtual port is allocated and persistently associated with the VF 316.

A virtual port is assigned for each VF provisioned to a VM. In the realmof storage virtualization, there exists a notion of a virtual port thatexists in software. A virtual port has a unique port address just as aphysical port does. The virtualization intermediary 310 performs aseries of logins to the fabric on behalf of the VM associated with theVF to which the virtual port is assigned, to authenticate the virtualport and to receive a transport address. As a result of the loginoperation, the virtual port is authenticated with the fabric, and has anestablished transport connection to a physical SCSI target. Thus, the VFassociated with the virtual port serves as an initiator in a SCSI pathtriplet (virtual SCSI initiator, physical SCSI target, physical LUN).

During both resource provisioning and the runtime life of a virtualmachine, physical SCSI storage utilized for virtual disks to be directlyaccessed by a given VF should be ‘visible’ from that VF's virtual port.The provisioning utility 350 communicates to the storage stack 328 theidentity of the virtual port that is bound to the virtual function 316.A virtual port identifier is used by constituents of the virtualizationintermediary 310 to associate physical storage resources with the VF 316corresponding to such virtual port. The storage stack tracks SCSI pathsseparately for virtual SCSI initiator ports verses physical SCSIinitiator ports. That and the provisioning of physical storage for useby the VF 316 and access to this storage at runtime by the VF must bevia that virtual port.

The existence and identity of the virtual port is communicated from thePF to a VF, and the VF ensures that I/O that the VF 316 sends on thephysical storage (e.g., SAN) 308 utilizes the SCSI transport address ofthe virtual port and not the adapter's physical port SCSI transportaddress.

During ‘steady state’ operation, described more fully below, the virtualport is used to exchange SCSI read/write communications via a SCSItransport level protocol (e.g. SAS, Fibre Channel or iSCSI) between theVF and regions of physical storage identified in terms of (physicaltargets, physical logical units). However, the virtual SCSI initiatorwithin the HSA 320 and the virtual SCSI target within the VF 316 employSCSI Parallel Interface (SPI) transport protocol, which is not portbased, in their communications with each other. In the SPI protocol, thevirtual SCSI initiator identifier and the virtual SCSI target identifierserve as connections for the SPI. Hence, the VF driver 321 has knowledgeof the virtual SCSI target identifier associated with the VF 316. Hence,the virtual LUN information does not serve as SPI connectioninformation.

Also, in the course of instantiating (i.e. creating) a new virtualmachine 304, the provisioning utility 350, allocates one or morepreviously created virtual disks (i.e. virtual SCSI logical units) tosuch new virtual machine 304. Many, perhaps hundreds of virtual diskscan be allocated to a given virtual machine 304. Each virtual disk thatis allocated to the given virtual machine 304 is assigned a uniquevirtual address represented by second 363 mapping that includes atwo-tuple comprising a virtual SCSI target, virtual SCSI LUN. In theexample in FIG. 4C, virtual disk V_(D3) is allocated to two-tuple1;V_(D2) is allocated two-tuple2 and V_(D3) is allocated two-tuple3.

The virtual address information representing the second mapping metadata363 for each virtual disk is arrived at as follows. The virtualizationintermediary 310 creates HSA 320 in the course of instantiating thevirtual machine 304. The HSA 320 requests that a virtual SCSI targetemulation module 322 within the virtualization intermediary 310 performthe following tasks related to the instantiation of compute resources ofa virtual SCSI target emulation on the VF 314. The SCSI target emulationmodule 322 instantiates one or more virtual SCSI targets. For eachvirtual SCSI logical unit emulated by each virtual SCSI target, thevirtual SCSI target emulation module 322 initializes SCSI logical unitattributes such as capacity/size, access control restrictions (e.g.,read only), a virtual SCSI logical unit number (LUN) with which toassociate the virtual SCSI logical unit with the SCSI target, and alogical extent map.

The logical extent map contains a series of logical extents, one logicalextent for every portion of a virtual SCSI logical unit which is backedby a contiguous region of physical storage. Each logical extent consistsof a SCSI logical block address within the virtual SCSI logical unit, aphysical extent identifying the storage address of the exact location ofthe contiguous region of physical storage, and a length which indicatesthe size of the contiguous region of physical storage.

During the instantiation phase of the HSA 320, the HSA requests thecreation of a virtual SCSI target within the virtual SCSI targetemulation module 322 of the virtualization intermediary 310. Thevirtualization intermediary initiates the creation of a virtual SCSItarget within the VF 316. As indicated by arrow 370, the HSA 320requests to the virtual SCSI target emulation module 322 to forward avirtual target emulation mapping 317 comprising the first and secondmapping metadata 361 and 363 to the VF 316 corresponding to the VM 304that is being instantiated. It will be appreciated that the secondmapping metadata 363 indicates the virtual target and virtual LUNinformation associated with each virtual disk allocated to VM 304. Themodule 322 sends the mapping 317 to the PF driver 318, as indicated byarrow 372. The PF driver 318 formats the mapping 317 as control I/Ocontrol blocks and sends them to the PF as indicated by arrow 374. ThePF 314 forwards the mapping 317 to the VF 316 via interconnect circuitryof the physical adapter 306. During runtime, as explained in sectionsbelow, the VF 316 uses mapping 317 to map virtual SCSI logical units(virtual disks) allocated to the virtual machine 304 to possiblymultiple disparate regions across possibly multiple physical SCSIlogical units each possibly accessible via different physical SCSItargets.

Referring to FIG. 4D, the instantiation of the virtual machine 304 alsoincludes simulation of an actual “power on” event on the virtualhardware. In reaction to this event, the virtual machine's BIOS enters aboot sequence and starts up the guest operating system 307 of the givenvirtual machine 304. As indicated by arrow 380, the guest OS 307 probesits virtualized PCI bus and matches virtualized PCI devices with guestdrivers registered as capable of managing these virtualized PCI devices.The VF driver 321 claims the hybrid storage adapter (HSA) 320.

The virtual SCSI initiator, which may be associated with one or morevirtual SCSI targets, is located within the HSA 320. Virtual SCSIlogical units (i.e. virtual disks) allocated to a given virtual machine304 are identified using one or more different virtual SCSI LUNsassociated with one or more such virtual SCSI targets.

As part of the initialization of its claimed device, that is the HSA320, the VF driver 321 retrieves attributes from the HSA 320. The VFdriver finds out about the first and second HSA PCI memory spaces 311,313 via the HSA's emulated PCI configuration space 309. The VF driver321 issues control messages to the HSA 320 to retrieve HSA specificattributes such as its virtual SCSI initiator address and all virtualSCSI targets accessible via that SCSI initiator. These messages are sentover the second HSA PCI memory space 313 as indicated by arrow 382,where they are trapped and emulated by the HSA 320 and forwarded to thevirtualization intermediary 310 as indicated by arrow 384. The SCSItarget emulation module 322 within the virtualization intermediary 310informs the VF driver 321 of the virtual SCSI targets associated withthe VF 316 allocated to VM 304 as indicated in the mapping 317 asindicated by arrow 386. It will be appreciated that by using the virtualport bound to the VF 316, the storage stack 328 can access the same setof SCSI paths as can the VF 316.

As indicated by arrow 388, the VF driver 321 passes informationconcerning the existence of the discovered virtual SCSI targets to thehigher level software within the guest operating system 307. A storagestack 379 of the guest operating system 307 probes the SCSI target forvirtual SCSI LUNs of that SCSI target by sending SCSI commands via thesecond HSA memory space 313 to all possible SCSI logical units on allsuch virtual SCSI targets. The virtual SCSI LUNs correspond to the oneor more virtual logical units (i.e. virtual disks) allocated to thevirtual machine 304. The guest operating system of the virtual machine304, therefore, has knowledge of the (virtual SCSI target, virtual SCSILUN) address information that identify virtual SCSI logical units (i.e.virtual disks) allocated to the given new virtual machine 304.

Persons skilled in the art will appreciate that the use first mappingmetadata 361 to map virtual SCSI logical units (i.e. virtual disks) toportions of physical SCSI logical units and the use of a second mappingmetadata 363 to map virtual SCSI address two-tuple information (i.e.virtual SCSI target, virtual SCSI LUN) to virtual disks allocated to agiven virtual machine 304 facilitates dynamic changes to the firstmapping 361 to accommodate changing storage requirements of the avirtual disk identified by a given virtual SCSI address two-tuple.

IOV Access Operations

FIGS. 5-6 are illustrative transition diagrams to represent theoperation of the system 300 of FIG. 3 and FIGS. 4A-4D. FIG. 5 is anillustrative transition diagram that illustrates process flow during asuccessful IOV Read/Write operation. FIG. 6 is an illustrativetransition diagram that illustrates process flow during an IOV readoperation in which an error is identified by a virtual function VF 316.Transitions shown within the FIGS. 5-6 that are identical are labeledwith identical reference numerals. Errors, both recoverable andun-recoverable, may be discovered by the VF 316 or by a physical SCSItarget in the course of an IOV operation.

The process of FIGS. 5-6 is performed using a virtual machine 304 andvirtualization intermediary 310 and a VF 316 that interoperate with thehost machine 302 and the adapter 306. Certain components in FIGS. 5-6that correspond to the virtual machine 304 and the virtualizationintermediary 310 and the VF 316 act to configure the host machine 302 orthe physical adapter 306 according to machine readable program codestored in a machine readable storage device to perform specifiedfunctions of the components. The transition of FIGS. 5-6 representsdifferent particular machines created in the course of a storage accessoperation.

In the following sections, it will be understood that the VF driver 321acts as a “para-virtual” device that knows which commands to direct tothe first HSA PCI memory space 311 and knows which commands to direct tothe second HSA PCI memory space 313. For example, in the course ofinstantiation of the given virtual machine 304, which is described abovewith reference to FIGS. 4A-4C, the VF driver 321 directs SCSI controlplane commands (i.e., non-Read/Write SCSI commands) to the second HSAPCI memory space 313. During runtime operation, the VF driver 321directs only certain SCSI Read/Write commands to the first HSA memoryspace 311 and directs all other SCSI commands to the second HSA memoryspace 313.

During runtime, the HSA 320 in conjunction with the VF driver 321directs certain SCSI commands, i.e. certain Read/Write SCSI commands,via the first HSA PCI memory space to the VF 316 for direct access tothe host machine's storage adapter. In particular SCSI Read Commands andSCSI Write Commands with Command Data Blocks (CDBs), forms of IO accessrequests, having byte sizes of 6, 10, 12 or 16 are directed to the VF316 via the first HSA PCI memory space 311 for direct access to thestorage adapter. Other SCSI commands such as for configuration ordiscovery, etc., are directed via the second HSA PCI memory space 313 tothe virtualization intermediary 310, which performs emulated processing.As a result, the SCSI target emulation located on the VF 316 can besimplified does not have to service non-read/write commands.

Furthermore, during normal runtime operation, it is the guest operatingsystem 307 that determines which logical unit (i.e. virtual disk) isimplicated by a given physical storage access request. It will beappreciated that during runtime, the VF driver 321 uses a uniquecombination comprising virtual SCSI initiator identifier and virtualSCSI target identifier to form SPI connections with the VF 316. Avirtual port bound to the VF 416 serves as a transport address for theVF 316. The virtual port in combination with a physical port of a SCSItarget and a physical SCSI LUN of the physical SCSI target act as a pathbetween the VF 316 and physical storage 308.

The VF driver 321 must service SCSI task management operations in amanner which is savvy towards the existence of both HSA memory spaces.The VF driver must track to which memory space a SCSI command wasdispatched in order to be able to abort this command if requested at alater time. Furthermore, SCSI bus, target, and LUN reset task managementoperations must involve resetting both memory spaces successfully beforethe SCSI reset operation may be considered complete.

Successful IOV Access

Referring to both FIG. 3 and to the process 500 of FIG. 5, anapplication 305 running on the virtual machine 304 issues a Read/Writeaccess request to the guest OS 307, represented by transition 502. TheVF driver 321 assembles a virtual SCSI IO request. In some embodiments,the virtual SCSI IO request comprises a Virtual IO Dispatch ControlBlock containing amongst other information an embedded SCSI command datablock (CDB) for the read/write operation and addresses identifying boththe virtual SCSI target and the virtual SCSI logical unit beingaccessed. Through its virtual SCSI initiator, the VF driver 321 placesthe Block on a Virtual IO Block Request queue for access by the VF 316.The VF driver 321 notifies the VF 316 of the addition of the newDispatch Control Block on the Virtual IO Block Request queue via thefirst HSA PCI memory space 311 as indicated by transition 504. The SCSIIO request is retrieved by the VF via DMA from the host memory and thenis accessed and serviced by the VF 316.

Within the VF 316, an IOV mapping module 326 inspects the Virtual I/ODispatch Control Block to determine whether the Virtual IO DispatchControl Block indicates an error condition. If the VF 316 finds no errorcondition (discussed below), then as indicated by transition 505, itmaps the virtual SCSI target identifier information and the virtual SCSIlogical unit identifier information in the Virtual I/O Dispatch ControlBlock to one or more regions of one or more physical SCSI targets andphysical SCSI logical units in accordance with a mapping 317 forwardedfrom the SCSI target emulation module 322, which comprises the firstmetadata 361 and the second metadata 363. A single virtual SCSI logicalunit (i.e. virtual disk) can be mapped to multiple different regions ofphysical storage.

Servicing a SCSI IO request by the VF 316 involves module 326 parsingits embedded SCSI CDB to retrieve the SCSI LBA and length fields. Next,module 326 accesses the logical extent map of the virtual SCSI logicalunit to spawn a set of possibly multiple physical “child” SCSI IOrequests across a set of possibly multiple physical SCSI logical unitswhere each physical SCSI logical unit may be accessible via a differentphysical SCSI target. Spawning each respective physical SCSI IO requestinvolves allocating a respective Physical I/O Dispatch Control Block andSCSI CDB, initializing these data structures in accordance with theparameters of the virtual SCSI logical unit's logical extent map and theparameters of the SCSI CDB embedded within the Virtual IO DispatchControl Block.

If the request involves a Read then information is retrieved fromphysical storage 308. If the SCSI IO request involves a write, theninformation is written to physical storage 308. Subsequently, after theSCSI IO request (Read or Write) has been successfully performed, the VF316 through its virtual target places an IOV Completion Control Block onan IOV completion control block queue 331 for access by the VF driver321. The VF 316 notifies the VF driver 321 of the addition of the newCompletion Control Block on the Virtual IO Block Reply queue via aninterrupt. As mentioned above, in some embodiments, a virtual SCSIinitiator within the VF driver 321 and the virtual SCSI target withinthe VF 316 employ SCSI Parallel Interface (SPI) transport protocol,which is not port based, in their communications with each other.

The format of the Virtual I/O Dispatch Control Block sent by the VFdriver 321 is storage adapter vendor dependent and is expected toinclude information of the type set forth in Table 1. The SCSI initiatorand SCSI target IDs are SCSI Parallel Interface transport addresses.

TABLE 1 (Virtual I/O Dispatch Control Block Issued by VF driver) SCSI(R/W) Command SCSI Virtual Initiator ID SCSI Virtual Target ID SCSIVirtual Logical Unit Number (LUN) Serial Number (used to uniquelyidentify command for abort) VM Physical Memory Addresses for Read/WriteData exchange with Physical Storage

The VF 316 assembles and transmits to the storage region 308 a physicalSCSI IO request. In some embodiments, the physical SCSI IO requestcomprises a Physical I/O Dispatch Control Block as indicated bytransition 508 for each region of physical storage to be accessed. ThePhysical I/O Dispatch Control Block is storage adapter vendor dependentand is expected to include information of the type set forth in Table 2.The SCSI initiator and SCSI target IDs are SCSI transport addressspecific to the SCSI transport protocol being used.

TABLE 2 (Physical I/O Dispatch Control Block Issued by VF) SCSI (R/W)Command SCSI Physical Initiator ID SCSI Physical Target ID SCSI PhysicalLogical Unit Number (LUN) Serial Number (used to uniquely identifycommand for abort) VM Physical Memory Addresses for Read/Write Dataexchange with Physical Storage

Assuming for the sake of this example, that the Virtual IO DispatchControl Block includes a ‘Read’ command, and that such command issuccessfully processed by the storage region 308, then the storageregion responds by providing the requested data to the VF 316.

The provided data may comprise a plurality of data packets. A completionrouting module 328 of the VF 316 receives the data packets from thestorage region 308 and causes the received data to be sent to DMA logic330 within the adapter 306. The DMA logic 330 cooperates with IO MMUlogic 332 within the host machine 302, to read the data directly into amemory space of host machine physical memory (not shown) that has beenallocated to the virtual machine 304 running the application 302 thatoriginally requested the data. The DMA logic 330 stores retrieved datain the host memory address space specified in the Physical IO DispatchControl Block. As explained above, a single virtual logical unit (i.e.virtual disk) may correspond to multiple regions of physical storage,and therefore, a single Virtual I/O Dispatch Control Block can result intransmission of multiple Physical I/O Dispatch Control Blocks. The DMAlogic 330 may employ a well known ‘DMA scatter gather address’ techniqueto assemble all of the retrieved data into the specified host addressspace. The DMA logic 330 and the IO MMU logic 332 may be implemented infirmware or hardware, for example. Data that is in transit between theVM and the physical storage in response to the virtual IO request isstored temporarily in the specified host address space.

It will be appreciated that newer host machines allow for IOV byproviding isolation of VMs during DMA transfers from an external device(e.g. an IOV adapter) to host machine memory (not shown) and vice versa.More particularly, I/O MMU logic 332 provides mapping for DMA operationsso as to isolate a VM's guest OS address space when providing directassignment from such VM's guest OS to VF. The IO MMU logic 332 providesmemory isolation for the DMA to/from physical pages assigned to each VM.

Continuing with the assumption that a successful retrieval of dataensues, the storage region (i.e. physical SCSI target) 308 sends SCSIstatus information embedded within the SCSI command reply, indicated bytransition 510, to the virtual port of the VF 316 on the physicaladapter 306 upon completion of a Physical I/O Dispatch Control Block.That is, the storage region 308 sends a completion message uponcompletion of the transmission of all data requested (in the case of aRead) or upon receipt of all data that has been transmitted (in the caseof a Write access) for all individual Physical I/O Dispatch ControlBlocks that were sent.

The SCSI command completion includes information indicative of whetherthe dispatched SCSI command completed successfully, and if the SCSIcommand did not complete successfully, the SCSI command completionincludes information indicative of the nature of the error (orexception) that resulted in failure to properly complete. See, SBC-3,Revision 15, Section 4.14 Error Reporting, As used herein, the term‘error’ encompasses exceptions.

It is the responsibility of the VF's Completion Routing component 328 toboth wait for all outstanding physical SCSI commands to physical storagespawned for a given virtual SCSI command to a virtual SCSI logical unitto complete before actually completing the virtual SCSI command to thevirtual SCSI logical unit and to inspect the completion message for eachcompletion message for each physical SCSI command to physical storage todetermine whether any of the physical completions contain indication ofan error or of a recoverable error condition.

Assuming in this example that the module 328 determines that all of thespawned physical SCSI commands to physical storage have completed andfinds no error or exception with any of the completing physical SCSIcommands to physical storage and all, then the VF 316 queues thecompletion control block to the VF's reply queue indicating the SCSIstatus and SCSI sense codes for the completed command. The VF 316 issuesan MSI or MSI-X physical interrupt using the MSI or MSI-X interruptvector allocated and bound to the VF when the VF was instantiated,indicated by transition 512. The MSI or MSI-X interrupt is fielded bythe host system's IO Advanced Programmable Interrupt Controller (IOAPIC)334. The IOAPIC directs the interrupt to the host system's InterruptDescriptor Table (IDT) of a particular physical processor of the hostsystem. The IDT/IOAPIC directs the interrupt to the PF driver 318, asindicated by transition 514, which realizes that the physical interruptwas issued by the VF 316 by detecting the MSI/MSI-X interrupt vectorover which the interrupt was received. The PF driver 318, in turn,informs the HSA 320 of the interrupt, as indicated by transition 516.The HSA 320 sends a virtual interrupt to the VF driver, as indicated bytransition 518, which registered a virtual interrupt service routine(ISR) to indicate successful completion of the Physical I/O DispatchControl Block. In the course of processing the interrupt, the VF driver318 informs the requesting application 305 as indicated by transition520. The VM 304, through the VF driver 321, for example, performscompletion processing, which may involve processes such as de-queuingstored IO structures, providing access to Read data on memory pagesallocated to an IO request by upper software layers, releasing IOstructures and releasing memory pages, for example.

Virtual interrupts may be coalesced in accordance as described incommonly assigned U.S. patent application Ser. No. 12/687,999 inventedby H. Subramanian, et al., entitled Guest/Hypervisor InterruptCoalescing for Storage Adapter Virtual Function in Guest PassthroughMode, filed on even date herewith, which is expressly incorporatedherein by this reference.

One reason for routing the physical interrupt to a virtual interruptwithin the virtualization layer 310 is the absence of host hardwareresources to reliably provide interrupt isolation between multiplevirtual machines fielding interrupts which would allow a physicalinterrupt generated by a VF 316 to be directly handled by an interruptservice routine previously registered by a VM in a virtualized instanceof IDT allocated to that VM to be invoked by the host system's IOAPIC.With sufficient hardware support, a physical interrupt could be sentdirectly to the VF driver 321 to report the completion.

IOV Access with Recoverable Error Condition Identified by VF that Can BeCorrected by the Virtualization Intermediary

FIG. 6 is an illustrative transition diagram that illustrates processflow during an IOV Read/Write operation by the system of FIG. 3 in whichan error is identified by a virtual function. Transitions shown withinFIG. 6 that are identical to transitions described with reference toFIG. 5 are labeled with identical reference numerals and are not furtherdescribed. An example of a recoverable error condition that may bediscovered by the VF 316 is an attempted access to an unallocatedportion of a virtual SCSI logical unit (i.e. virtual disk), which mayarise due to dynamic provisioning of a virtual SCSI logical unitallocated to the virtual machine seeking such access. Persons skilled inthe art will understand that one approach to sharing storage resourcesamong multiple virtual machines involves provisioning a virtual SCSIlogical unit with only a small amount of the physical storage thatactually has been allocated, with the intent of allocating additionalportions of the virtual SCSI logical unit later when there is an actualrequest for access within an unallocated portion of the virtual SCSIlogical unit.

A SCSI target's logical extent map 317 communicated to the VF 316indicates which portions of a virtual disk are allocated and which areunallocated. Basically, a gap in the logical sequence of the virtualdisk corresponds to an unallocated region of virtual storage. Thus theVF 316 can recognize when a SCSI CDB contains a SCSI LBA and lengthfield which refers to any portion of an unallocated region of a virtualSCSI logical unit. The VF 316 generates an error condition, which in thecase of a request for unallocated portion of a virtual logical unit(i.e. virtual disk), is an example of a recoverable error condition.

Upon the VF's discovery of a reference to an unallocated region of avirtual disk for an Virtual I/O Dispatch Control Block, the VF 316queues the Block thereby suspending the Virtual I/O Dispatch ControlBlock and communicates the error condition to the PF 314 via theinterconnect circuitry 324, indicated by transition 602. The PF 314creates an IO Exception Control Block, which it adds to the PF's IOcompletion control block queue 342. The PF 314 generates a newunsolicited MSI/MSI-X interrupt represented by transition 604, to theIDT/IOAPIC X34. The invocation of the interrupt service routine isrepresented by transition 606. This interrupt may utilize the sameMSI/MSI-X vector utilized for notifying the PF driver 318 of IOcompletions for SCSI commands previously submitted by the PF driver orit may be a separate MSI/MSI-X vector dedicated for this purpose. If theformer, the PF driver 318 should be capable of distinguishing between IOcompletion control blocks for previously dispatched SCSI commands and anIO exception control block.

In response to the interrupt, the PF driver 318 de-queues theaforementioned IO Exception Control Block from the PF IO completionqueue 342, inspects the Block to determine whether it is for a VFexception condition and forwards the exception condition to the virtualSCSI target emulation control plane 322—which had previously registereda callback routine with the PF driver 318 for this exact usage. Theinvocation of such routine in the virtual SCSI target emulation controlplane 322 is represented by transition 608. The SCSI target emulationcontrol plane 322 responds to the reference to unallocated region oflogical extent recoverable error condition by both searching forsufficient available storage to satisfy the IO request in the filesystem from which the affected virtual disk was originally allocated andcauses a modification of the file system's first mapping metadata 361for the virtual disk file to change the physical region allocation ofthe storage to that particular virtual disk file according to thereceived virtual SCSI IO request.

The PF 314 communicates the modified virtual SCSI logical unit logicalextent map to first mapping metadata persistent storage 650 which ispart of persistent storage, via PF driver 318 and PF 314 as indicated bytransitions 610, 612 and 614. Accordingly, the metadata storage 650updates the block allocation map of the file system containing theaffected virtual disk file. Once the modifications to the file system'smetadata are on persistent storage, updated contents of the mapping 317provided by the virtual SCSI logical unit are communicated by the SCSItarget emulation 322 to the PF driver 318, as indicated by transition616, which in turn, communicates the new mapping information to the PF314, as indicated by transition 618. In transition 620, the PF 314communicates the new mapping via the interconnect circuitry 324 to theVF 316.

Before the VF 316 resumes processing of the suspended Virtual I/ODispatch Control Block from the dispatch queue, it informs the SCSIvirtual target emulation 322 of the successful receipt of the new mapinformation and awaits receipt of a resume instruction. In respectivetransitions 622-626, IO completion information originating with the VF316 passes from the VF to the PF (transition 622). The PF driver sendsan interrupt message to the IDT/IOAPIC (transition 624), which in turninvokes an interrupt service routine of the PF driver 318 (transition626). The PF driver invokes an IO completion callback registered by theSCSI target emulation control plane 322 (transition 628).

In respective transitions 630-634, a resume IO control message is passedfrom the SCSI virtual target emulation 322 to the PF driver 318 and fromthe PF driver to the PF 314 and from the PF to the VF 316 via theinterconnect circuitry 324. As indicated by transitions 505 and 508, theVF uses the newly modified SCSI virtual target emulation mapping that ithas received to map (transition 505) the Virtual I/O Dispatch ControlBlock to one or more Physical I/O Dispatch Control Block s and transmits(transition 508) the latter Blocks to the storage region 308. Thestorage region 308 responds to the one or more Physical I/O DispatchControl Blocks.

Following the reception of responses for all Physical I/O DispatchControl Blocks by 328, the completion is processed according totransitions 510-520 described in FIG. 5.

IOV Access with Non-Recoverable Error Condition Identified by VF thatCannot Be Corrected by the Virtualization Intermediary

Referring to FIG. 5, at transition 504 the VF 316 may discover an errorthat cannot be corrected. For example, the VF driver 321 may refer to avirtual SCSI target that does not exist. That is, a virtual SCSI targetor virtual SCSI logical unit referred to in a Virtual I/O DispatchControl Block has not yet been or may never be instantiated within thevirtual target emulation mapping on the VF 316. Upon discovery of anuncorrectable error, then the VF may simply not respond to the I/O—asshould be the case for the non-existent SCSI target use case.Alternatively, as should be the case for the non-existent/non-mappedSCSI LUN use case, the VF may form an IO Completion Control Block withthe appropriate SCSI status and SCSI sense data filled in according toSBC-3, Revision 15, Section 4.14 and SPC-3, Revision 23, queues thecontrol block to the VF completion queue, and generates an interruptusing the VF's MSI/MSI-X vector.

IOV Access with Error Condition Identified by Storage

Referring to FIG. 5, at transition 510 a physical SCSI target withinstorage 308 may report an unrecoverable error such as a physical SCSILUN that has not been mapped to a physical SCSI logical unit or a SCSImedium error. In the case of an unrecoverable error, the VF 316 createsan IO Completion Control Block and reports the error to the VF driver321. In the case of recoverable error such as a SCSI check condition ora SCSI reservation conflict error, the VF 316 reports the error to thevirtualization intermediary via the PF 314 and the PF driver 318. Thevirtualization intermediary 310 may take steps to resolve the error andcommunicate a resolution to the VF 316 via the PF driver 318 and thePF314 or may ignore the error and instruct the VF 316 to ‘resume’—thatis, to continue to return an uncorrectable error to the VF driver via anIO Completion Control Block—for example.

SCSI Target Reference Model

As described above, a collaborative SCSI target emulation in accordancewith some embodiments is provided, which resides across both a PCIeSR-IOV virtual function and a virtualization intermediary. In accordancewith some embodiments, the emulated SCSI targets are accessible only viaSCSI initiator devices residing in virtual machines and managed byvirtualization savvy para-virtualized virtual function drivers. In otherwords, virtual SCSI targets are accessed only by virtual SCSI initiatorembedded within the HSA 320, not by physical SCSI initiators. The VFdriver acquires virtual initiator information from the HSA 320 andsubmits this information through the HSA memory spaces.

Moreover, in accordance with some embodiments, there are limits upon thescope of a virtual SCSI initiator. While a single VF may house multiplevirtual SCSI target devices, each of these virtual SCSI target devicesmay be accessed by only a single virtual SCSI initiator device. Eachvirtual SCSI initiator device may of course access multiple virtual SCSItargets co-located in the same VF. Any given SCSI logical unit isaccessible only via a single VF SCSI target device. Recall that avirtual SCSI initiator is used for communication between a VM and a VFand not for communication by the VF with a physical SAN over a physicalnetwork, for example.

A VF's SCSI target device does not support the “head of queue” and“ordered” SCSI Tagged Command Queuing modes since these commands are notrequired by the reference model herein. Moreover, SCSI Reserve/Releasecommands and SCSI Persistent Group Reservations are neither supported,nor needed for a SCSI target accessed by but a single SCSI initiator. Itwill be appreciated that SCSI Release/Reserve commands, which are usedto release and reserve logical units, are required when there aremultiple possible initiators or where a logical unit is available tomultiple targets. However, since the reference model herein is limitedto allowing only one virtual SCSI initiator to access a virtual SCSIlogical unit through one virtual SCSI target, there is no need for theRelease/Reserve commands.

As explained above, in some embodiments, a VF's SCSI target service only6-, 10-, 12-, and 16-byte SCSI read and write commands. In addition, theVF's SCSI target will service SCSI task abort and SCSI (bus, target,LUN) reset task management requests. SCSI task management is supportedvia the use of a typed task management control block.

The VF direct access path supports aborts. A SCSI task—that is, a singleSCSI read or write command since linked/tagged SCSI commands are notsupported—directed to a VF SCSI target can be aborted by queuing aVirtual Task Management Control Block Request to the Virtual IO BlockRequest queue of the VF's SCSI target device. The task managementcontrol block type must be set to indicate a task abort and the controlblock must indicate a storage adapter dependent sequence number used touniquely identify the virtual IO Dispatch Control Block of the SCSI taskto abort which was previously queued to the Virtual IO Block Requestqueue of the same SCSI target. All outstanding—that is, ones which havenot yet been completed—virtual SCSI commands/tasks must be tracked.Furthermore, the tracking mechanism must identify a way to uniquelyidentify each of the possibly multiple physical SCSI commands which werespawned as a result of servicing the virtual SCSI command to be aborted.Servicing the task abort of a virtual SCSI command must involvesuccessfully aborting each of the physical SCSI commands resulting fromservicing the single virtual SCSI command. The abort task request can becompleted only after each of these aborted physical SCSI commands hascompleted and the aborted virtual SCSI command has been completed with aSCSI TASK ABORTED status.

SCSI bus reset, target reset, and LUN reset are three classes of resettask management operations that are serviced by mapping these to theappropriate set of SCSI task abort operations. A reset ultimately ismapped to an abort. Each virtual SCSI bus consists of but a single SCSIinitiator and a single SCSI target. Therefore, a SCSI bus reset ismapped to resetting a single SCSI target. A SCSI target may managemultiple logical units. Therefore, a SCSI target reset is mapped toresetting each of the logical units managed by that SCSI target. EachSCSI logical unit may support multiple outstanding SCSI commands/tasks.Therefore, a SCSI LUN reset is mapped to aborting each outstanding SCSIcommand/task for the SCSI logical unit mapped to that SCSI LUN for thespecified SCSI target.

As shown above in Tables 1 and 2, each SCSI command includes a uniquecommand identifier so that a virtual SCSI target on a VF can distinguishone command from another on the same virtual SCSI logical unit on thesame virtual SCSI target so as to distinguish which command to abort,for example. It is possible that there may be a desire to abort just oneSCSI command out of perhaps thousands of SCSI commands.

While Auto Contingent Allegiance (ACA) conditions are not supported by aVF's SCSI target device, coherent servicing of SCSI unit attention checkconditions requires consistent awareness of the existence of the SCSIunit attention check condition and the specific SCSI sense dataassociated with it across both the SCSI target emulation module 322 inthe virtualization intermediary and the SCSI target emulation residingin the VF 316. Whenever a SCSI logical unit either enters into or exitsfrom a unit attention state, it is necessary that both components of thecollaborative SCSI target emulation be aware of the change in unitattention state for the SCSI logical unit. A mechanism is provided toalert the virtualization intermediary whenever the VF's SCSI targetdevice discovers a SCSI unit attention condition for any of the SCSIlogical units it manages. Mechanisms to share a SCSI logical unit's unitattention state across the virtualization intermediary and the VF's SCSItarget device are also required when (1) the virtualization intermediarydiscovers that a logical unit enters a unit attention state and (2)whenever a logical unit transitions out of a unit attention condition.

Tracking SCSI Commands through Multiple Levels

SCSI command tracking to associate a virtual SCSI command with multiplephysical SCSI commands spawned by such virtual SCSI command is wellknown. For example, a virtual SCSI command may seek access toinformation that is disposed across different physical SCSI logicalunits of two different SCSI physical targets. In that case, anindividual physical SCSI command may be spawned from a single virtualSCSI command for each of the two different physical SCSI targets. Acompletion for the virtual SCSI command is not generated until allphysical SCSI commands produced in response to such virtual SCSI commandhave completed. SCSI command tracking has been used in the past todetermine when all physical SCSI commands spawned by such virtual SCSIcommand have completed.

Tracking also is used in some embodiments for resets and to abort SCSIcommands. For instance, it is well known that when a virtual SCSIcommand fails to complete within a timeout period, then all outstandingphysical SCSI commands spawned by that virtual SCSI command are aborted.The failure to complete may be due to a a failed connection between astorage adapter and a physical SCSI initiator or due to a transportfailure with the physical SCSI target, for example. Assuming that theseaborts occur successfully, the virtual SCSI command may be retransmittedresulting once again in the spawning of multiple physical SCSI commands.Typically, if an abort of one or more physical SCSI commands fails, thenthe physical SCSI logical unit associated with the physical SCSI commandis reset. Ordinarily, if the SCSI logical unit reset succeeds, then thevirtual SCSI command may be retransmitted, or an error may be reportedto an application that generated the SCSI command, for example. If onthe other hand the SCSI logical unit reset fails, then a SCSI physicaltarget associated with the physical SCSI logical unit is reset, forexample.

FIG. 7 is an illustrative drawing of mapping process 700 run within theVF 316 to track physical SCSI commands associated with a virtual SCSIcommand for completion processing and for processing aborts and resetsin accordance with some embodiments of the invention. Adapter resources339, which include processing resources and memory device resources, areconfigured to act as a particular machine to implement the mappingprocess 700. The mapping process 700 involves use of a first levelmapping 702 between a given virtual SCSI command and all physical SCSIcommands spawned by such virtual SCSI command. The mapping process 700involves use of a second level mapping 704 between a given virtual SCSIlogical unit and all virtual SCSI commands associated with the virtualSCSI logical unit. The mapping process 700 involves use of a third levelmapping 706 between a virtual SCSI target and all virtual SCSI logicalunits managed by the virtual SCSI target. The mapping process 700involves use of a fourth level mapping 708 between the virtual SCSItarget and a virtual SCSI bus. As explained above, however, the HSA 320has only a single virtual SCSI initiator, and thus one virtual SCSI bus.

The mappings 702-708 may comprise mapping structures such as tables thatare disposed a memory space associated with the VF 316. Note that inFIG. 7, multiple instances of the mappings 702-708 are shown. A firstlevel mapping 702 is created on the VF 316 when a physical SCSI commandis spawned from a virtual SCSI command. For example if a given virtualSCSI command spawns five physical SCSI commands then five unique serialnumbers are produced for the five new physical SCSI commands, and thefirst level mapping 702 uses those serial numbers to map the physicalSCSI commands to the virtual SCSI command that spawned them. The firstmapping 702 for a given virtual SCSI command is deleted upon completionof the virtual SCSI command (i.e. upon completion of all physical SCSIcommands spawned by it.) A second level mapping 704 is created on the VF316 when a virtual command is issued for a given virtual SCSI logicalunit. The second mapping 704 maps the virtual SCSI command to a virtualSCSI logical unit. A third level mapping 706 is created on the VF 316when a new logical unit is created. The third level mapping maps thevirtual SCSI logical unit to a virtual SCSI target. As explained abovethe fourth level mapping 708 maps a single virtual SCSI target to avirtual SCSI initiator.

In decision block 710, a determination is made as to the level at whicha mapping is required. The process of 700 begins when an abort or areset command is sent in block 701 over the first memory space 311 ofthe HSA 320. When decision block 710 determines that a given virtualSCSI command is to be aborted, control flows to a first branch 712,which uses to uses a first level mapping 702 to identify physical SCSIcommands associated with the virtual SCSI command to be aborted. Eachphysical command associated with the given virtual SCSI command isaborted.

When decision block 710 determines that a given virtual logical unit isto be aborted, control flows to a second branch 714, which uses to usesa second level mapping 704 to identify each virtual SCSI commandassociated with the virtual SCSI logical unit to be reset. Next, foreach identified virtual SCSI command, a first level mapping 702 is usedto identify physical SCSI commands associated with the virtual SCSIcommand. Each physical SCSI command associated with an identifiedvirtual SCSI command is aborted.

When decision block 710 determines that a given virtual target is to beaborted, control flows to a third branch 716, which uses to uses a thirdlevel mapping 706 to identify each virtual SCSI logical unit associatedwith the virtual SCSI target to be reset. Next, for each identifiedvirtual SCSI logical unit, a second level mapping 704 is used toidentify each virtual SCSI command associated with the virtual SCSIlogical unit. Following that, for each identified virtual SCSI command,a first level mapping 702 is used to identify each physical SCSI commandassociated with the virtual SCSI command. Each physical SCSI commandassociated with an identified virtual SCSI command is aborted.

When decision block 710 determines that a given virtual target is to beaborted, control flows to a fourth branch 718, and processing occurs inmuch the same way as the third branch 716 since there exists only onevirtual SCSI target for the virtual SCSI initiator (bus).

The foregoing description and drawings of embodiments in accordance withthe present invention are merely illustrative of the principles of theinvention. Therefore, it will be understood that various modificationscan be made to the embodiments by those skilled in the art withoutdeparting from the spirit and scope of the invention, which is definedin the appended claims.

1. In a system that includes a host computing machine configured toimplement a virtualization intermediary and a virtual machine (VM) andthat includes a storage adapter, a method of VM access to physicalstorage through a direct path to a virtual function (VF) of the storageadapter, comprising: sending from a VF driver of the VM to the VF avirtual SCSI IO request that identifies a virtual SCSI transportaddress; mapping within the VF the identified virtual disk address to aregion of the physical storage; creating within the VF a physical SCSIIO request that identifies a physical address for the mapped-to physicalregion; sending the physical SCSI IO request from the VF to the physicalstorage.
 2. The method of claim 1, wherein the virtual SCSI IO requestincludes a virtual IO dispatch control block.
 3. The method of claim 1,wherein the physical SCSI IO request includes a physical IO dispatchcontrol block.
 4. The method of claim 1, wherein the identified virtualaddress identifies a virtual target and a virtual logical unit; andwherein the virtual SCSI IO request further includes a logical extentmap for the identified virtual logical unit.
 5. The method of claim 1further including: receiving in a memory space of a memory device withinthe host system data in transit between the VM and the physical storagein response to the virtual IO request.
 6. The method of claim 5, whereinthe virtual SCSI IO request comprises a request to read data from thephysical storage; and wherein the received data is in transit fromphysical storage to the VM.
 7. The method of claim 5, wherein thevirtual SCSI IO request comprises a request to write data to physicalstorage; and wherein the received data is in transit from the VM tophysical storage.
 8. The method of claim 1 further including: receivingwithin the VF from the physical storage an IO completion message.
 9. Themethod of claim 8 further including: generating an interrupt to informthe VF driver of the receipt of the IO completion message.
 10. Themethod of claim 8 further including: issuing a physical interrupt by theVF in response to receipt of the IO completion message; wherein thephysical interrupt informs a PF driver within the virtualizationintermediary of the IO completion message; the PF driver and thevirtualization intermediary causing generation of a virtual interrupt toinform the VF driver of the receipt of the IO completion message. 11.The method of claim 10 further including: processing an IO completionwithin the VM in response to the virtual interrupt.
 12. The method ofclaim 10, wherein causing generation of a virtual interrupt includes:the PF driver informing a hybrid storage adapter associated with the VMof the physical interrupt, and the hybrid storage adapter generating thevirtual interrupt.
 13. The method of claim 8 further including:inspecting by the VF the IO completion for an error condition;communicating a discovered error condition from the VF to a physicalfunction (PF) of the physical adapter; communicating the error conditionfrom the PF to the virtualization intermediary.
 14. The method of claim1 further including: inspecting by the VF the virtual SCSI IO requestfor a recoverable error condition; suspending the virtual SCSI IOrequest in response to discovery of a recoverable error condition;communicating the discovered error condition from the VF to thevirtualization intermediary; correcting the error condition within thevirtualization intermediary; communicating the error conditioncorrection from the virtualization intermediary to the VF; andunsuspending the virtual SCSI IO request in response to the correctionof the error condition.
 15. The method of claim 14, whereincommunicating the discovered error condition from the VF to thevirtualization intermediary includes communicating the discovered errorcondition from the VF to a physical function (PF) of the storage adapterand from the PF to a PF driver of the virtualization intermediary. 16.The method of claim 15, wherein communicating the discovered errorcondition from the PF of the PF driver includes the PF creating an IOException Control Block and issuing an interrupt to inform the PF driverof the IO Exception Control Block; and further including: the PF driveraccessing the Exception Control Block and informing a virtual SCSItarget emulation control plane within the virtualization intermediary ofthe discovered error condition in response to the interrupt.
 17. Themethod of claim 14, wherein communicating the error condition correctionfrom the virtualization intermediary to the VF includes communicatingthe error condition correction from the PF driver to the PF and from thePF to the VF.
 18. The method of claim 1 further including: inspecting bythe VF the virtual SCSI IO request for an error condition involving anunallocated region of the virtual disk identified by the virtual addressin the virtual IO request; suspending the virtual SCSI IO request inresponse to discovery a recoverable error condition; communicating thediscovered error condition from the VF to the virtualizationintermediary; changing by the virtualization intermediary an allocationof the physical storage to the identified virtual disk; communicatingthe changed allocation from the virtualization intermediary to the VF;and unsuspending the virtual SCSI IO request in response to the changedallocation.
 19. The method of claim 18, wherein communicating thediscovered error condition from the VF to the virtualizationintermediary includes communicating the discovered error condition fromthe VF to a physical function (PF) of the storage adapter and from thePF to a PF driver of the virtualization intermediary.
 20. The method ofclaim 18, wherein communicating the changed allocation from thevirtualization intermediary to the VF includes communicating the changedallocation from the PF driver to the PF and from the PF to the VF. 21.The method of claim 18, wherein communicating the discovered errorcondition from the VF to the virtualization intermediary includescommunicating the discovered error condition from the VF to a physicalfunction (PF) of the storage adapter and from the PF to a PF driver ofthe virtualization intermediary; and wherein communicating the changedallocation from the virtualization intermediary to the VF includescommunicating the changed allocation from the PF driver to the PF andfrom the PF to the VF.
 22. The method of claim 18, wherein communicatingthe changed allocation includes transmitting an updated mapping from thevirtualization intermediary to the VF for use in mapping the virtualdisk addresses to physical regions of the physical storage.
 23. In asystem that includes a host computing machine configured to implement avirtualization intermediary and a virtual machine (VM) and that includesa storage adapter, a method of VM access to physical storage through adirect path to a virtual function (VF) of the storage adapter,comprising: sending from a VF driver of the VM to the VF a virtual SCSIIO request that identifies a virtual SCSI transport address; mappingwithin the VF the identified virtual disk address multiple physicalregions of the physical storage; creating by the VF multiple differentrespective physical SCSI IO requests that identify different respectivephysical addresses for each of the multiple mapped-to physical regions;sending from the VF to the physical storage the multiple physical SCSIIO requests.
 24. The method of claim 23, wherein each respectivephysical SCSI IO request includes an identifier that indicates thevirtual SCSI IO request that spawned it.
 25. The method of claim 24further including: receiving by the VF a request to abort the virtualSCSI IO request; in response to the abort request, using the respectiverequest identifiers to identify the respective physical SCSI IO requestsspawned by the aborted virtual SCSI IO request; and aborting therespective identified physical SCSI IO requests.
 26. In a system thatincludes a host computing machine configured to implement avirtualization intermediary and a virtual machine (VM) and that includesa storage adapter, a method of VM access to physical storage through adirect path to a virtual function (VF) of the storage adapter,comprising: providing a hybrid storage adapter (HSA) that includes afirst HSA memory space that provides access from the VM directly to theVF and that includes a second HSA memory space that provides access fromthe VM to a physical function (PF) of the storage adapter withvirtualization intermediary intervention; sending a message by a hybridstorage adapter that causes the virtualization intermediary to transmita copy of a mapping information for the virtual disks allocated to theVM from the virtualization intermediary to the PF and from the PF overthe physical storage adapter to the VF; sending from a VF driver of theVM via the hybrid storage adapter to the VF a virtual SCSI IO requestthat identifies a virtual SCSI transport address; using the mappingtransmitted to the VF to map the identified virtual disk address to atleast one physical region of the physical storage; creating by the VF aphysical SCSI IO request that identifies a physical address for the atleast one mapped-to physical region; sending the physical SCSI IOrequest from the VF to the physical storage.